Project: Explore Braun dataset¶
Upstream Steps
- Assemble adata
- Filtered to select cortex, cerebellum, thalamus
- Sub-sampled to 75000 cells
This notebook
- QC filter on cells
- Expression filter on genes
- Normalization and log10 transformation by Scanpy functions
- Feature selection (HVG) by Scanpy functions
- Dimensionality reduction
- Batch correction by Harmony
Dataset Information¶
References¶
- Paper Reference: Comprehensive cell atlas of the first-trimester developing human brain
- Data and code repository: GitHub Repo
Short description: 26 brain specimens spanning 5 to 14 postconceptional weeks (pcw) were dissected into 111 distinct biological samples. Each sample was subjected to single-cell RNA sequencing, resulting in 1,665,937 high-quality single-cell
Subsample of the dataset selecting cortex, cerebellum and thalamus
- Erythrocyte
- Fibroblast
- Glioblast
- Immune
- Neural crest
- Neuroblast
- Neuron
- Neuronal IPC
- Oligo
- Placodes
- Radial glia
- Vascular
1. Environment¶
import scanpy as sc, anndata as ad, numpy as np, pandas as pd
import warnings
import yaml
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.sparse as sp
import statsmodels.api as sm
import scanpy as sc
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot
import warnings
warnings.filterwarnings('ignore')
from scipy import stats
warnings.filterwarnings('ignore')
import plotly.express as px
import plotly.io as pio
import itertools
import sys
pio.renderers.default = "jupyterlab"
import random
random.seed(1)
homeDir = os.getenv("HOME")
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
# import user defined functions from the utils folder
import matplotlib.pyplot as plt
sys.path.insert(1, "../2_Day2/utils")
from CleanAdata import *
from SankeyOBS import *
scanpy==1.10.1 anndata==0.10.7 umap==0.5.6 numpy==1.26.4 scipy==1.11.4 pandas==2.2.2 scikit-learn==1.4.2 statsmodels==0.14.2 igraph==0.11.5 pynndescent==0.5.12
1.3 Input file¶
You can set parameters useful for your analysis in the cells below and then recall them during the analysis
path = '../DataDir/InputData/'
Id = 'Project_FiltNormAdata.h5ad'
input_file = path + Id
1. Data Load¶
adata = sc.read_h5ad(input_file)
adata.var_names_make_unique()
2. Additional QCs¶
Here we perform few additional inspections to ensure to avoid technical issues being carried over to the next steps.
First we compute again the quality metrics:
# mitochondrial genes, "MT-" for human, "Mt-" for mouse
adata.var["mt"] = adata.var_names.str.startswith("MT-")
# ribosomal genes
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
# hemoglobin genes
adata.var["hb"] = adata.var_names.str.contains("^HB[^(P)]")
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt", "ribo", "hb"], inplace=True, log1p=True
)
# Format better Donor and Week columns
adata.obs["Og_Donor"] = adata.obs["Og_Donor"].str.replace(":","")
adata.obs["Og_Week"] = adata.obs["Og_Week"].astype("string")
sc.pl.pca(adata, color=["Og_Donor", "Og_Week","Og_Chemistry","Og_Subregion"], ncols=4, wspace=.5, size = 50, vmin='p1', vmax='p99')
2.4 Batches inspection¶
adata.obs
| Og_Subregion | Og_Tissue | Og_TopLevelCluster_x | Og_dissection | Og_Donor | Og_Age | Og_Week | Og_Chemistry | run_id | sample_id | ... | pct_counts_in_top_50_genes | pct_counts_in_top_100_genes | pct_counts_in_top_200_genes | pct_counts_in_top_500_genes | total_counts_mt | log1p_total_counts_mt | pct_counts_mt | total_counts_hb | log1p_total_counts_hb | pct_counts_hb | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CellID_Index | |||||||||||||||||||||
| b'10X208_6:ACGGTCGCAGTCAGTT' | Cortex | Occipital_cortex | 2 | Occipital_cortex | XDD358 | 11.5 | 11 | v3 | NoInfo | 10X208_6 | ... | 4.429574 | 8.076678 | 13.885225 | 26.650324 | 22.181249 | 3.143344 | 0.541272 | 1.102416 | 0.743087 | 0.026901 |
| b'10X212_6:GTGCTGGTCTTTCAGT' | Cortex | Lower_cortex | 0 | Lower_cortex | XDD359 | 13.0 | 13 | v3 | NoInfo | 10X212_6 | ... | 4.607994 | 8.334671 | 14.549377 | 28.904248 | 25.100336 | 3.261948 | 0.646785 | 1.425516 | 0.886044 | 0.036733 |
| b'10X89_6:ACTTTCACAAACTGCT' | Thalamus | Thalamus | 13 | Thalamus | BRC2006 | 8.0 | 8 | v2 | NoInfo | 10X89_6 | ... | 6.697738 | 12.189211 | 21.691530 | 44.923577 | 23.801810 | 3.210917 | 0.891311 | 1.789264 | 1.025778 | 0.067003 |
| b'10X199_8:CTAAGTGAGCTATCTG' | Thalamus | Thalamus | 13 | Thalamus | BRC2191 | 6.9 | 6 | v3 | NoInfo | 10X199_8 | ... | 4.537666 | 8.189201 | 14.390066 | 29.213019 | 29.283055 | 3.410588 | 0.768188 | 0.933989 | 0.659585 | 0.024502 |
| b'10X199_5:TACCTGCGTAAGATTG' | Thalamus | Thalamus | 34 | Thalamus | BRC2191 | 6.9 | 6 | v3 | NoInfo | 10X199_5 | ... | 5.401125 | 9.620665 | 16.112945 | 30.466863 | 31.569998 | 3.483392 | 0.877779 | 0.000000 | 0.000000 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| b'10X156_1:CAAGAAAGTCTAGTGT' | Cerebellum | Cerebellum | 9 | Cerebellum | XDD334 | 8.0 | 8 | v2 | NoInfo | 10X156_1 | ... | 5.977457 | 10.746987 | 18.915256 | 39.211040 | 11.394504 | 2.517253 | 0.376367 | 0.000000 | 0.000000 | 0.000000 |
| b'10X254_7:CACTAAGTCGTCAGAT' | Thalamus | Thalamus | 38 | Thalamus | XDD385 | 14.0 | 14 | v3 | NoInfo | 10X254_7 | ... | 4.178680 | 7.443501 | 13.039368 | 26.484720 | 33.267573 | 3.534200 | 0.827652 | 1.797263 | 1.028642 | 0.044713 |
| b'10X102_4:GCGCAGTCATTTCACT' | Cortex | Cortex | 28 | Cortex | XHU297 | 10.0 | 10 | v2 | NoInfo | 10X102_4 | ... | 6.114603 | 10.842398 | 17.966865 | 34.507028 | 21.706862 | 3.122667 | 0.670763 | 1.065353 | 0.725301 | 0.032920 |
| b'10X104_6:GGCAATTCAATACGCT' | Thalamus | Thalamus | 13 | Thalamus | BRC2057 | 8.1 | 8 | v2 | NoInfo | 10X104_6 | ... | 5.582894 | 9.874799 | 17.299339 | 36.175986 | 18.719663 | 2.981616 | 0.589286 | 0.000000 | 0.000000 | 0.000000 |
| b'10X109_3:AACTTTCTCCAGAAGG' | Cerebellum | Cerebellum | 32 | Cerebellum | BRC2073 | 6.6 | 6 | v2 | NoInfo | 10X109_3 | ... | 6.175726 | 11.201448 | 19.565389 | 40.368541 | 15.931441 | 2.829172 | 0.535812 | 1.454401 | 0.897883 | 0.048915 |
66567 rows × 35 columns
plotSankey(adata, covs=["Og_Donor","Og_Subregion","Og_Chemistry"])
2.2. Multiplets removal via scDblFinder¶
import anndata2ri
import rpy2.rinterface_lib.callbacks
import logging
anndata2ri.activate()
%load_ext rpy2.ipython
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)
import os
from rpy2 import robjects
# Set R_HOME and R_LIBS_USER
# Set R_HOME and R_LIBS_USER
# Set R_HOME and R_LIBS_USER
os.environ['R_HOME'] = '/opt/R/4.3.1/lib/R/bin//R'
os.environ['R_LIBS_USER'] = '/opt/R/4.3.1/lib/R/library'
from rpy2 import robjects
custom_lib_paths = "/opt/R/4.3.1/lib/R/library"
robjects.r(f'.libPaths(c("{custom_lib_paths}", .libPaths()))')
# Clean anndata from unnecessary fields
# Clean anndata from unnecessary fields
# Clean anndata from unnecessary fields
if os.path.isfile("./utils/DoubletCLass.tsv"):
print("Loading precomputed dbls class")
dblfinderClass = pd.read_csv("./utils/DoubletCLass.tsv", sep="\t", index_col=0)
adata.obs['scDblFinder.classs'] = dblfinderClass
else:
dblfinderClass = pd.DataFrame()
for sample in adata.obs["sample_id"].unique():
print(sample)
sce = adata[adata.obs["sample_id"] == sample].copy()
sce = sce.copy()
sce.obs = sce.obs[["sample_id"]]
del sce.obsp
sce.layers["counts"] = sce.layers["counts"].astype(np.int32).todense()
sce.X = sce.layers["counts"].copy()
sce = sce.copy()
del sce.obsm
del sce.varm
del sce.uns
sce.var.index.name = None
sce.var = sce.var[["EnsembleCode"]]
# Run doublets detection
# Run doublets detection
# Run doublets detection
sce = anndata2ri.py2rpy(sce)
print(sce)
scDblFinder = rpy2.robjects.packages.importr('scDblFinder')
S4Vectors = rpy2.robjects.packages.importr('S4Vectors')
as_data_frame = rpy2.robjects.r['as.data.frame']
sce = scDblFinder.scDblFinder(sce)
dblfinderClass = pd.concat([dblfinderClass,sce.obs[["scDblFinder.class"]]])
#SingletsList.extend(sce.obs["scDblFinder.class"].tolist())
del sce
dblfinderClass["scDblFinder.class"].to_csv("./utils/DoubletCLass.tsv", sep="\t")
# Save doublets info column
#sce.obs["scDblFinder.class"].to_csv("...", sep="\t")
Loading precomputed dbls class
adata.obs = pd.concat([adata.obs, dblfinderClass], axis = 1).loc[adata.obs.index]
adata = adata[adata.obs["scDblFinder.class"] == "singlet"]
2.3 Additional filtering ?¶
For downstream analysis we want to be more conservative, so we inspect the relationship between the percentage of mitochondrial genes and the number of genes detected in each cell to identify outliers:
# refine filtering
sc.pl.scatter(adata, x="pct_counts_mt",y="n_genes_by_counts", size=40, color="pct_counts_ribo")
sc.pl.pca(adata, color=["Og_Donor","Og_Subregion","Og_Chemistry","cell_class","Og_Week"])
We can see that cells with a low number of genes have higher percentage of mitochondrial and ribosomal genes, so we filter out those:
adata = adata[adata.obs["n_genes_by_counts"] >= 2000]
2.4.1 PCA regressors to check variance associated with covariates¶
A useful assessment consists in understanding how much of the variability of the PCA is explained by the covariates ("Donor_region","Auth_Batch","Auth_Assay_formatted","Auth_Age","cell_label"). We can use Ordinary Least Squares regression on principal component (PC) embeddings and plot the residuals for specific covariates. This will help us understand whether for a specific principal component and a specific covariates, we observe differences across the values of the covariate. This will highlight covariates and batches that may impact the PC and answer the question "How much technical and biological factors guide the dimensionality reduction?"
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
def plotResiduals(adata, covToTest, npcs):
PCRegrDict = {}
sns.set_theme(style="ticks")
varianceSummary = pd.DataFrame()
for i in range(npcs):
for n,c in enumerate(covToTest):
# format the data
Dummies = pd.get_dummies(adata.obs[c])
PCRegr = (Dummies.T * adata.obsm["X_pca"][:,i].T).T.melt()
PCRegr = PCRegr[PCRegr["value"] != 0]
PCRegr["variable"] = PCRegr["variable"].astype("category")
PCRegr["cov"] = c
PCRegrDict["PC:{}_COV:{}".format(i+1,c)] = PCRegr.copy()
sns.despine(offset=30)
PCRegrModel = pd.get_dummies(PCRegrDict["PC:{}_COV:{}".format(i+1,c)], "variable").copy()
# define the regression formula
formula = "".join(["value ~ "," + ".join(["C("+c+")" for c in PCRegrModel.columns[1:-1].tolist()])])
# fit the regression
fit = ols(formula, data=PCRegrModel).fit()
# fit.rsquared_adj
# get the residuals
PCRegr["rsquared_adj"] = fit.rsquared_adj
PCRegr["PC"] = i
varianceSummary = pd.concat([varianceSummary,PCRegr ], ignore_index=True)
CovOrder = {i:varianceSummary.drop_duplicates(["PC","cov"])[varianceSummary.drop_duplicates(["PC","cov"])["PC"] == i].sort_values("rsquared_adj", ascending=False)["cov"].tolist() for i in varianceSummary["PC"].unique().tolist()}
for i in range(npcs):
fig, axes = plt.subplots(1,len(covToTest), figsize=(40,4), dpi=300, sharex=False,sharey=False)
plt.subplots_adjust(wspace=.5)
for n,c in enumerate(CovOrder[i]):
PCRegr = varianceSummary[(varianceSummary["PC"] == i) & (varianceSummary["cov"] == c)]
sns.boxplot(data=PCRegr, x="variable", y="value", ax = axes[n],whis=[0, 100],
palette={i:adata.uns[c+"_colors"][adata.obs[c].cat.categories.tolist().index(i)] for i in PCRegr["variable"].unique().tolist()}
#order=[i for i in PlotOrder[c] if i in PCRegr["variable"].unique().tolist()],
#hue_order=[i for i in PlotOrder[c] if i in PCRegr["variable"].unique().tolist()])
)
sns.stripplot(data=PCRegr, x="variable", y="value", size=4, color=".3",ax = axes[n],
#order=[i for i in PlotOrder[c] if i in PCRegr["variable"].unique().tolist()],
#hue_order=[i for i in PlotOrder[c] if i in PCRegr["variable"].unique().tolist()])
)
axes[n].title.set_text('Covariate:{}'.format(c))
axes[n].set_xlabel(c, fontsize=20)
axes[n].set_ylabel("PC{} embeddings".format(i+1), fontsize=20)
sns.despine(offset=30)
fit.rsquared_adj = PCRegr["rsquared_adj"].values[0]
axes[n].text(0.5, 0, "rsquared_adj:{}".format(round(fit.rsquared_adj,2)), horizontalalignment='center', verticalalignment='center',transform=axes[n].transAxes)
axes[n].tick_params(axis="x", rotation=45)
plt.xticks(rotation=45)
fig.autofmt_xdate(rotation=45)
npcs = 6
covToTest = ["Og_Donor","Og_Subregion","Og_Chemistry","cell_class","Og_Week"]
plotResiduals(adata, covToTest, npcs)
<Figure size 640x480 with 0 Axes>
3. This dataset entails even more dataset, metacells could be again a good idea¶
3.1 Data preparation¶
BadCelltypes = ["Immune_Cortex_Pre",
"Vascular_Cortex_Pre",
"Immune_Cerebellum_Pre",
"Vascular_Cerebellum_Pre",
"Vascular_Thalamus_Pre",
"Fibroblast_Cortex_Pre",
"Fibroblast_Cerebellum_Pre",
"Immune_Thalamus_Pre",
"Erythrocyte_Cortex_Pre"]
BadCelltypes = '|'.join(BadCelltypes)
adata = adata[~adata.obs.cell_class.str.contains(BadCelltypes)]
3.2 Metacells definition and aggregation¶
adata.obs["Donor_region"] = adata.obs["Og_Donor"].astype("str")+"_"+adata.obs["Og_Subregion"].astype("str")
NewAdataList = []
import os
os.environ["OMP_NUM_THREADS"] = "4"
n_neighbors = 30
n_pcs = 8
n_top_genes = 2000
n_clusters = 200
grouping_obs = "Donor_region"
#######################################
if not os.path.exists("./utils/kmeansAggregation/"):
os.makedirs("./utils/kmeansAggregation/")
for sample in adata.obs[grouping_obs].unique().tolist():
print(sample)
adataLocal = adata[adata.obs[grouping_obs] == sample].copy()
if adataLocal.shape[0] <= 400:
continue
ncells = adataLocal.shape[0]
filenameBackup = "./utils/kmeansAggregation/Kmeans{}.{}.{}topgenes.{}neighbs.{}pcs.{}Ks.csv".format(ncells,sample,n_top_genes,n_neighbors,n_pcs,n_clusters )
# Check if the kmeans was precomputed
if os.path.isfile(filenameBackup):
print("Retrieving precomputed kmeans classes \n")
KmeansOBS = pd.read_csv(filenameBackup, sep="\t", index_col=0)
adataLocal.obs['kmeans_clusters'] = KmeansOBS
else:
#If not run k-means metacells aggregation
sc.pp.highly_variable_genes(adataLocal, n_top_genes=n_top_genes, flavor="seurat")
sc.tl.pca(adataLocal)
# Step 1: Compute the KNN graph
sc.pp.neighbors(adataLocal, n_neighbors = n_neighbors , transformer='pynndescent', n_pcs=n_pcs, metric="euclidean") # Adjust n_neighbors as needed
# Step 2: Extract the connectivity matrix from AnnData
connectivities = adataLocal.obsp['distances'].toarray()
# Step 3: Apply K-Means clustering with a fixed number of clusters
print("Computing kmeans")
adataLocal.obs['kmeans_clusters'] = KMeans(n_clusters=n_clusters, random_state=0).fit_predict(connectivities)
print("Storing kmeans classes \n")
adataLocal.obs["kmeans_clusters"].to_csv(filenameBackup, sep="\t")
adataLocal.X = adataLocal.layers["counts"].copy()
# Normalize counts
sc.pp.normalize_total(adataLocal, target_sum=2e4)
# Create combined AnnData objects efficiently and add to NewAdataList
# Step 2: Create dummy variables for clusters
dummy_clusters = pd.get_dummies(adataLocal.obs['kmeans_clusters'])
# Step 3: Dot product for mean aggregation of expression values
# Each column in `cluster_aggregated_X` represents mean expression for a cluster
X_dense = adataLocal.X.A if hasattr(adataLocal.X, "A") else adataLocal.X
cluster_aggregated_X = dummy_clusters.T.dot(X_dense)
cluster_aggregated_median_X = np.zeros((dummy_clusters.shape[1], X_dense.shape[1]))
for cluster_idx in range(dummy_clusters.shape[1]):
# Select cells belonging to the current cluster
cluster_cells = X_dense[dummy_clusters.iloc[:, cluster_idx].values == 1]
# Compute the median across cells within the cluster
cluster_aggregated_median_X[cluster_idx, :] = np.median(cluster_cells, axis=0)
# Convert to AnnData structure
adataAggregated = ad.AnnData(X=cluster_aggregated_X)
adataAggregated.layers["median"] = cluster_aggregated_median_X # we save an additional layer having as aggregated value the median of the gene expression
adataAggregated.var_names = adataLocal.var_names
adataAggregated.obs['kmeans_clusters'] = dummy_clusters.columns
# Step 4: Aggregating labels and metadata
# Get the most common cell label for each cluster
adataAggregated.obs['AggregatedClass'] = (
adataLocal.obs.groupby('kmeans_clusters')['cell_class']
.agg(lambda x: x.value_counts().idxmax())
.reindex(dummy_clusters.columns)
.values
)
adataAggregated.obs['AggregatedLabel'] = (
adataLocal.obs.groupby('kmeans_clusters')['cell_label']
.agg(lambda x: x.value_counts().idxmax())
.reindex(dummy_clusters.columns)
.values
)
adataAggregated.obs_names = list(sample+"_"+adataAggregated.obs["kmeans_clusters"].astype(str))
# Assign metadata fields with identical values across each cluster
for obsMD in [col for col in adataLocal.obs.columns if len(adataLocal.obs[col].unique()) == 1]:
adataAggregated.obs[obsMD] = adataLocal.obs[obsMD].iloc[0]
# Add the aggregated AnnData to NewAdataList
NewAdataList.append(adataAggregated)
XDD358_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XDD359_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2191_Thalamus
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XHU297_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XDD358_Cerebellum
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2061_Thalamus
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2073_Cerebellum
XDD385_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2114_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XDD342_Cerebellum
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XDD385_Cerebellum
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2061_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XHU292_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XDD351_Cerebellum
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2073_Thalamus
BRC2006_Thalamus
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2061_Cerebellum
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XHU307_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2006_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2191_Cerebellum
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2191_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XDD385_Thalamus
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2147_Cerebellum
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
XDD334_Cerebellum
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2057_Thalamus
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2110_Cerebellum
XDD351_Cortex
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2021_Thalamus
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2114_Cerebellum
Retrieving precomputed kmeans classes
normalizing counts per cell
finished (0:00:00)
BRC2057_Cerebellum
Save¶
# Save the new aggregated anndata
from scipy import sparse
CombinedAdata = ad.concat(NewAdataList)
# Conevert to sparse matrix
CombinedAdata.X = sparse.csr_matrix(CombinedAdata.X)
CombinedAdata.layers["median"] = sparse.csr_matrix(CombinedAdata.layers["median"])
print(adata.shape)
(45852, 15077)
print(CombinedAdata.shape)
(5200, 15077)
CombinedAdata.write_h5ad("./1_CombinedMetaCells.h5ad")
Re process¶
n_neighb = 40
n_pcs = 8
CombinedAdata = ad.concat(NewAdataList)
# Keep only clusters with at least 10 cells
AggregatedOBS = CombinedAdata.obs["AggregatedLabel"].value_counts().loc[CombinedAdata.obs["AggregatedLabel"].value_counts() > 100].index.tolist()
CombinedAdata = CombinedAdata[CombinedAdata.obs["AggregatedLabel"].isin(AggregatedOBS)]
import scvelo as scv
scv.tl.score_genes_cell_cycle(CombinedAdata)
calculating cell cycle phase
computing score 'S_score'
finished: added
'S_score', score of gene set (adata.obs).
637 total control genes are used. (0:00:01)
computing score 'G2M_score'
finished: added
'G2M_score', score of gene set (adata.obs).
475 total control genes are used. (0:00:00)
--> 'S_score' and 'G2M_score', scores of cell cycle phases (adata.obs)
sc.pp.normalize_total(CombinedAdata, target_sum=2e4)
sc.pp.log1p(CombinedAdata)
sc.pp.highly_variable_genes(CombinedAdata, n_top_genes=2000, flavor="seurat")
sc.pp.pca(CombinedAdata)
sc.pl.pca(CombinedAdata,color=["Donor_region","brain_region","Og_Age","Og_Chemistry"])
normalizing counts per cell
finished (0:00:00)
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
computing PCA
with n_comps=50
finished (0:00:00)
plotSankey(CombinedAdata, covs=["Donor_region","Og_Age","Og_Chemistry","Og_Subregion"])
HVGs = []
for i in CombinedAdata.obs["Og_Chemistry"].unique():
CombinedAdataKIT = CombinedAdata[CombinedAdata.obs["Og_Chemistry"] == i]
MinBatches = 2
sc.pp.highly_variable_genes(CombinedAdataKIT, n_top_genes=2000, flavor="seurat", batch_key="Og_Donor")
kitHVGs = CombinedAdataKIT.var_names[CombinedAdataKIT.var["highly_variable_nbatches"] >= 2 ].tolist()
print(len(kitHVGs))
HVGs.extend(kitHVGs)
#sc.pp.hig (CombinedAdataKIT, batch_key="MinBatches", flavour="seurat")
print(len(set(HVGs)))
CombinedAdata.var["highly_variable"] = CombinedAdata.var_names.isin(HVGs)
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
2929
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
4067
4824
sc.tl.pca(CombinedAdata)
sc.pl.pca_variance_ratio(CombinedAdata)
sc.pp.neighbors(CombinedAdata, n_neighbors=n_neighb, n_pcs=n_pcs, metric="euclidean")
sc.tl.umap(CombinedAdata)
computing PCA
with n_comps=50
finished (0:00:01)
computing neighbors
using 'X_pca' with n_pcs = 8
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:02)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:08)
Integration¶
import scanpy.external as sce
CombinedAdata = CleanAdata(CombinedAdata, obstokeep=CombinedAdata.obs.columns.tolist(), obsmtokeep="X_pca")
sce.pp.harmony_integrate(CombinedAdata, key="Og_Chemistry", max_iter_harmony=20)
2024-11-15 21:05:16,382 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans... 2024-11-15 21:05:18,380 - harmonypy - INFO - sklearn.KMeans initialization complete. 2024-11-15 21:05:18,404 - harmonypy - INFO - Iteration 1 of 20 2024-11-15 21:05:19,173 - harmonypy - INFO - Iteration 2 of 20 2024-11-15 21:05:19,940 - harmonypy - INFO - Iteration 3 of 20 2024-11-15 21:05:20,704 - harmonypy - INFO - Iteration 4 of 20 2024-11-15 21:05:21,472 - harmonypy - INFO - Iteration 5 of 20 2024-11-15 21:05:22,236 - harmonypy - INFO - Iteration 6 of 20 2024-11-15 21:05:23,002 - harmonypy - INFO - Iteration 7 of 20 2024-11-15 21:05:23,766 - harmonypy - INFO - Iteration 8 of 20 2024-11-15 21:05:24,540 - harmonypy - INFO - Iteration 9 of 20 2024-11-15 21:05:25,067 - harmonypy - INFO - Iteration 10 of 20 2024-11-15 21:05:25,707 - harmonypy - INFO - Converged after 10 iterations
sc.pp.neighbors(CombinedAdata, n_neighbors=n_neighb, n_pcs=n_pcs, use_rep="X_pca_harmony")
sc.tl.umap(CombinedAdata)
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:08)
sc.tl.leiden(CombinedAdata, resolution=.8)
sc.tl.rank_genes_groups(CombinedAdata,groupby="leiden", method="wilcoxon")
running Leiden clustering
finished: found 16 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:02)
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:14)
sc.pl.umap(CombinedAdata, color=["leiden","Og_Subregion","AggregatedClass"],
size=10, wspace=.4, ncols=2, vmin='p1', vmax='p99',
add_outline=True, edges=False, edges_width=.01)
plotSankey(CombinedAdata, covs=["leiden","AggregatedClass"])
MAGIC¶
import palantir
dm_res = palantir.utils.run_diffusion_maps(CombinedAdata, knn=n_neighb, pca_key="X_pca_harmony", n_components=n_pcs)
ms_data = palantir.utils.determine_multiscale_space(CombinedAdata)
palantir.utils.run_magic_imputation(CombinedAdata)
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
array([[0.02159585, 0.54069483, 0.02333469, ..., 0.06535586, 0.53239476,
0.57256068],
[0.02731917, 0.50445414, 0. , ..., 0.01208675, 0.1718003 ,
0.02736146],
[0.03031441, 0.69590421, 0.01278159, ..., 0.04684644, 0.47414945,
0.170866 ],
...,
[0.02344872, 0.65439312, 0.02709426, ..., 0.14571777, 0.43550936,
0.18831748],
[0.02160011, 0.47292763, 0.07538638, ..., 0.1829114 , 0.43291091,
0.68595744],
[0.01736921, 0.48722275, 0.06338858, ..., 0.18861574, 0.42059673,
0.68379031]])
import decoupler as dc
CombinedAdata.X = CombinedAdata.layers["MAGIC_imputed_data"].copy()
# Download database of regulons
net = dc.get_collectri(organism='human', split_complexes=False)
net
net
| source | target | weight | PMID | |
|---|---|---|---|---|
| 0 | MYC | TERT | 1 | 10022128;10491298;10606235;10637317;10723141;1... |
| 1 | SPI1 | BGLAP | 1 | 10022617 |
| 2 | SMAD3 | JUN | 1 | 10022869;12374795 |
| 3 | SMAD4 | JUN | 1 | 10022869;12374795 |
| 4 | STAT5A | IL2 | 1 | 10022878;11435608;17182565;17911616;22854263;2... |
| ... | ... | ... | ... | ... |
| 43173 | NFKB | hsa-miR-143-3p | 1 | 19472311 |
| 43174 | AP1 | hsa-miR-206 | 1 | 19721712 |
| 43175 | NFKB | hsa-miR-21-5p | 1 | 20813833;22387281 |
| 43176 | NFKB | hsa-miR-224-5p | 1 | 23474441;23988648 |
| 43177 | AP1 | hsa-miR-144 | 1 | 23546882 |
43178 rows × 4 columns
dc.run_ulm(
mat=CombinedAdata,
net=net,use_raw=False,
source='source',
target='target',
weight='weight',
verbose=True
)
Running ulm on mat with 5149 samples and 15077 targets for 636 sources.
acts = dc.get_acts(CombinedAdata, obsm_key='ulm_estimate')
acts
AnnData object with n_obs × n_vars = 5149 × 636
obs: 'kmeans_clusters', 'AggregatedClass', 'AggregatedLabel', 'Og_Subregion', 'Og_Donor', 'Og_Age', 'Og_Week', 'Og_Chemistry', 'run_id', 'brain_region', 'scDblFinder.classs', 'scDblFinder.class', 'Donor_region', 'S_score', 'G2M_score', 'phase', 'leiden'
uns: 'Donor_region_colors', 'brain_region_colors', 'Og_Chemistry_colors', 'neighbors', 'umap', 'leiden', 'rank_genes_groups', 'leiden_colors', 'Og_Subregion_colors', 'AggregatedClass_colors', 'DM_EigenValues'
obsm: 'X_pca', 'X_pca_harmony', 'X_umap', 'DM_EigenVectors', 'DM_EigenVectors_multiscaled', 'ulm_estimate', 'ulm_pvals'
acts.obs.AggregatedClass.unique().tolist()
['Radial_glia_Cortex_Pre', 'Neuron_Cortex_Pre', 'Neuroblast_Cortex_Pre', 'Glioblast_Cortex_Pre', 'Neuronal_IPC_Cortex_Pre', 'Neuronal_IPC_Thalamus_Pre', 'Neuron_Thalamus_Pre', 'Radial_glia_Thalamus_Pre', 'Neuroblast_Thalamus_Pre', 'Neuron_Cerebellum_Pre', 'Neuroblast_Cerebellum_Pre', 'Glioblast_Cerebellum_Pre', 'Neuronal_IPC_Cerebellum_Pre', 'Radial_glia_Cerebellum_Pre', 'Glioblast_Thalamus_Pre']
acts_RG = acts[acts.obs["AggregatedClass"].isin(["Radial_glia_Cortex_Pre","Radial_glia_Thalamus_Pre","Radial_glia_Cerebellum_Pre"])].copy()
df = dc.rank_sources_groups(acts_RG, groupby='AggregatedClass', reference="rest", method='t-test_overestim_var')
n_markers = 10
source_markers = df.groupby('group').head(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markers
sc.tl.dendrogram(acts_RG, groupby="AggregatedClass")
sc.pl.matrixplot(acts_RG, source_markers, 'AggregatedClass', dendrogram=True, standard_scale='var',
colorbar_title='Z-scaled scores', cmap='RdBu_r')
using 'X_pca' with n_pcs = 50 Storing dendrogram info using `.uns['dendrogram_AggregatedClass']`
acts_NEU = acts[acts.obs["AggregatedClass"].isin(["Neuron_Cortex_Pre","Neuron_Thalamus_Pre","Neuron_Cerebellum_Pre"])].copy()
df = dc.rank_sources_groups(acts_NEU, groupby='AggregatedClass', reference="rest", method='t-test_overestim_var')
n_markers = 10
source_markers = df.groupby('group').head(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markers
sc.tl.dendrogram(acts_NEU, groupby="AggregatedClass")
sc.pl.matrixplot(acts_NEU, source_markers, 'AggregatedClass', dendrogram=True, standard_scale='var',
colorbar_title='Z-scaled scores', cmap='RdBu_r')
using 'X_pca' with n_pcs = 50 Storing dendrogram info using `.uns['dendrogram_AggregatedClass']`
acts_RG = acts[acts.obs["AggregatedClass"].isin(["Radial_glia_Cortex_Pre","Radial_glia_Thalamus_Pre","Radial_glia_Cerebellum_Pre"])].copy()
df = dc.rank_sources_groups(acts_RG, groupby='AggregatedClass', reference="rest", method='t-test_overestim_var')
n_markers = 10
source_markers = df.groupby('group').head(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markers
sc.tl.dendrogram(acts_RG, groupby="AggregatedClass")
sc.pl.matrixplot(acts_RG, source_markers, 'AggregatedClass', dendrogram=True, standard_scale='var',
colorbar_title='Z-scaled scores', cmap='RdBu_r')
using 'X_pca' with n_pcs = 50 Storing dendrogram info using `.uns['dendrogram_AggregatedClass']`
sc.pl.rank_genes_groups(CombinedAdata, n_genes=25, ncols=4, fontsize=8)